How to run benchmarks on rdflib stores

In this tutorial we will see the necessary step to run benchmarks on rdflib stores. This is a following to a previous work for getting and analysing the results of some benchmarks.

This document will cover:

  1. Loading data into a rdflib store

  2. Running the benchmarks on the rdflib store

Loading data into a rdflib store

First we need to import rdflib


In [3]:
import rdflib

Here we chose to use Sleepycat as the rdflib store, so we declare the store like this


In [4]:
graph = rdflib.Graph(store='Sleepycat',
                     identifier='my_benchmark')

Now we need to actually open the store


In [5]:
graph.open('my_sleepycat_store',
           create=True)  # set to False if you already have created the store


Out[5]:
1

We can check where we are (to know where the store will end up) with


In [6]:
!pwd


/home/vincent/projets/liris/ktbs_bench/bench_examples

The data we will put in the store are generated data from SP2Bench. Archives of this data are in the data directory. They are named after the number of triples they contain.

We will do this tutorial with the graph in data/32000.n3.

If you haven't extracted the archives, you can do so in Python:


In [7]:
from bz2 import BZ2File
data = BZ2File('../data/32000.n3.bz2')

otherwise, just declare the path to the n3 file by uncommenting the following:


In [8]:
# data = '../data/32000.n3'

We put this graph into our store with (the %time prefix is optional):


In [9]:
%time graph.parse(data, format='n3')


CPU times: user 9.51 s, sys: 46.8 ms, total: 9.56 s
Wall time: 9.67 s
Out[9]:
<Graph identifier=my_benchmark (<class 'rdflib.graph.Graph'>)>

We can check that our graph now contains around 32k triples


In [10]:
print("Size of {0} graph is {1} triples".format(graph.identifier, len(graph)))


Size of my_benchmark graph is 32330 triples

Running the benchmark on the rdflib store

We are going to use BenchManager to do the benchmarks.

We will measure some of the SPARQL queries defined in bench_examples/queries.py. So we first load the queries:


In [11]:
from queries import QUERIES

Now we setup the benchmark with the help of BenchManager


In [12]:
from ktbs_bench_manager import BenchManager
bmgr = BenchManager()

We make our store Sleepycat as a context for BenchManager, this is simply a function decorated by @bmgr.context that must yield a rdflib graph


In [13]:
@bmgr.context
def sleepycat():
    yield graph

In this case it is quiet simple because we already created the graph object previously. In more complex cases you may need to open, check, and close the graph inside the context (see here for an example).

Now we need to setup the bench functions for our BenchManager by decorating them with @bmgr.bench:


In [14]:
@bmgr.bench
def qall(some_graph):
    some_graph.query(QUERIES['query_all'])
    
@bmgr.bench
def q1(some_graph):
    some_graph.query(QUERIES['q1'])

We could go on and bench all the queries in QUERIES but this is not the purpose of this tutorial.

We have completed the setup of BenchManager, so we can now run it and output the results in a file:


In [15]:
bmgr.run('/tmp/my_bench.csv')

In the CSV file the columns are the context (in this case the only one we have setup, Sleepycat, but we can declare as much as we want), and the lines are the bench functions (in this case q1 and qall). The intersection of a line and a column is a time result (in seconds) for one bench function against a bench context.

The results for our little bench are:


In [16]:
!cat /tmp/my_bench.csv




Don't forget to close the graph ;)


In [17]:
graph.close()

Ending notes

If you plan to do more advanced benchmarks on rdflib stores you should consider:

  • using BenchableGraph from ktbs_bench_manager to have a consistent interface between different stores.

  • using the bench.py utility to run several defined benchmarks.